-
Notifications
You must be signed in to change notification settings - Fork 9
feat: Add random state feature. #150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: john-development
Are you sure you want to change the base?
Conversation
john-halloran
commented
Jun 6, 2025
- feat: Added random_state feature for reproducibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
We have to decide how much testing we will add. Ideal is 100% coverage, optimal is probably less.
Maybe write the docstrings so I can understand what the class does, then we can decide what to test?
MM, | ||
Y0=None, | ||
X0=None, | ||
A=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more descriptive name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many different standards for what to name these matrices. Zero agreement between sources that use NMF. I'm inclined to eventually use what sklearn.decomposition.non_negative_factorization uses, which would mean MM->X, X->W, Y->H. But I'd like to leave this as is for the moment until there's a consensus about what would be the most clear or standard. If people will be finding this tool from the sNMF paper, there's also an argument for using the X, Y, and A names because that was used there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sounds good. It has to be very good reason to break PEP8. The only good enough reason I can think of is to be consistent with scikit-learn
. Another way of saying it is that we can "adopt the scikit-learn standard"
@@ -4,8 +4,20 @@ | |||
|
|||
|
|||
class SNMFOptimizer: | |||
def __init__(self, MM, Y0=None, X0=None, A=None, rho=1e12, eta=610, max_iter=500, tol=5e-7, components=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need a docstring here and in the init. Please see scikit-package FAQ about how to write these. Also, look at Yucong's code or diffpy.utils?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added one here. The package init dates back to the old codebase, but as soon as that is updated it will get a docstring as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The package init (i.e., the __init__.py
) doesn't need a docstring.
@@ -15,23 +27,22 @@ def __init__(self, MM, Y0=None, X0=None, A=None, rho=1e12, eta=610, max_iter=500 | |||
# Capture matrix dimensions | |||
self.N, self.M = MM.shape | |||
self.num_updates = 0 | |||
self.rng = np.random.default_rng(random_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a more descriptive variable name? Is this a range? What is the range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping on this one.
if self.A is None: | ||
self.A = np.ones((self.K, self.M)) + np.random.randn(self.K, self.M) * 1e-3 # Small perturbation | ||
self.A = np.ones((self.K, self.M)) + self.rng.normal(0, 1e-3, size=(self.K, self.M)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K and M are probably good names if the matrix decomposition equation is in hte docstring, so they get defined there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you addressed this with your comment to MM
, but as a general rule, please respond to each comment so the reviewer knows you have seen it. It wouldn't work here, but just thumbs up works if you have seen a comment and agree, but it saves time in the long run as I don't have to write this long comment...... :)
Thanks, will work on resolving these. To be clear, for things like the docstrings would you prefer I make new PRs, get those merged, then rebase this one, or just add to this existing PR? |
For now, I will assume anything given as feedback in this PR could be in scope to include. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start. I left a couple of comments.
src/diffpy/snmf/snmf_class.py
Outdated
@@ -17,6 +17,64 @@ def __init__( | |||
components=None, | |||
random_state=None, | |||
): | |||
"""Run sNMF based on an ndarray, parameters, and either a number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic! Thanks for this. Please see here for our docstring standards, I am not sure if you looked at it:
https://scikit-package.github.io/scikit-package/programming-guides/billinge-group-standards.html#docstrings
For classes it is a bit tricky because what info do we put in the "Class" docstring and what info do we put in the "constructor" (i.e., the __init__()
) docstring. After some googling we came up with the breakout that is shown in the DiffractionObjects
class that is shown there. We would be after something similar here.
By way of example, I would probably do like this in this case
def SNMFOptimizer:
'''Configuration and methods to run the stretched NMF algorithm, sNMF
Instantiating the SNMFOptimizer class runs all the analysis
immediately. The results can then be accessed as instance attributes
of the class (X, Y, and A).
Please see <reference to paper> for more details about the algorithm.
Attributes
-----------
mm : ndarray
The array containing the data to be decomposed. Shape is (length_of_signal, number_of_conditions)
y0 : ndarray
The array containing initial guesses for the component weights
at each stretching condition. Shape is (number_of_components, number_of_conditions
...
'''
put future development plans into issues, not in the docstring. Just describe the current behavior. Try and keep it brief but highly informational.
To conform to PEP8 standards I lower-cased the variables. I know they correspond to matrices but we should decide which standard to break. The tie-breaker should probably be scikit-learn
. Whatever they do, let's do that. Let's also add a small comment (not in the docstring) to remind ourselves in the future if it breaks PEP8 or it will annoy me every time we revisit it and I will try and change it back......
Conditions on instantiation will go in the constructor docstring.
That one describes the init method so should look more like a function docstring. It would look something like....
def __init__(mm....)
'''Initialize a SNMFOptimizer instance and run the optimization
Parameters
------------
mm : ndarray
The array containing the data to be decomposed. Shape is (length_of_signal, number_of_conditions)
y0 : ndarray Optional. Defaults to None
The array containing initial guesses for the component weights
at each stretching condition. Shape is (number_of_components, number_of_conditions
...
I think there was some text before about how Y0
was required. But if it is required it may be better to make it a required (positional) variable in the constructor and not have it optional. we can discuss design decisions too if you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either Y0 or n_components needs to be provided. Currently, Y0.shape overrides n_components if both are provided, and throws an error if neither are provided. The way scikit-learn
is a little more flexible and also allows for an n_components which is different from Y0.shape, although I'm not clear on why you'd want that. But I'm not matching their behavior exactly because the current code doesn't allow that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scikit-learn
actually does break PEP8 to upper-case the matrices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good progress, please see comments.
@@ -4,6 +4,18 @@ | |||
|
|||
|
|||
class SNMFOptimizer: | |||
"""A self-contained implementation of the stretched NMF algorithm (sNMF), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too long. Needs to be < 80 characters, followed by a blank line.
|
||
For more information on sNMF, please reference: | ||
Gu, R., Rakita, Y., Lan, L. et al. Stretched non-negative matrix factorization. | ||
npj Comput Mater 10, 193 (2024). https://doi.org/10.1038/s41524-024-01377-5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would normally do a list of Class attributes here. Everything that is self.something
. This is obviously strongly overlapped with the arguments of the constructor, as many of the attributes get defined in the constructor, but logically they are different. Here we list and dsecribe the class attributes, there we describe the init function arguments.
of the class (X, Y, and A). Eventually, this will be changed such | ||
that __init__ only prepares for the optimization, which will can then | ||
be done using fit_transform. | ||
"""Initialize an instance of SNMF and run the optimization | ||
|
||
Parameters | ||
---------- | ||
MM: ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these need a space before the colon (not sure why we adopted that standard, but we did). So mm : ndarray
provided. | ||
The array containing initial guesses for the component weights | ||
at each stretching condition. Shape is (number of components, number of | ||
conditions) Must be provided if n_components is not provided. Will override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normally we would raise an exception if two conflicting things are provided (we don't want to guess which is the right one) unless there is a good functional reason to do it another way. We like to avoid "magic" and the current behavior of the code could be "magic". Please raise an exception unless there is a strong reason to do otherwise.
A stretching factor that influences the decomposition. Zero corresponds to | ||
no stretching present. Relatively insensitive and typically adjusted in | ||
powers of 10. | ||
The float which sets a stretching factor that influences the decomposition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to say the type here (float
) as it is given above. We can just say "The stretching factor...". The same is actually true above, too. Instead of "the array containing initial guesses", it usually works just as "The initial guesses..."
The maximum number of times to update each of A, X, and Y before stopping | ||
the optimization. | ||
tol: float | ||
The minimum fractional improvement in the objective function to allow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about "The convergence threshold. This is the minimum......"
without terminating the optimization. Note that a minimum of 20 updates | ||
are run before this parameter is checked. | ||
n_components: int | ||
The number of components to attempt to extract from MM. Note that this will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attempt? So sometimes it extracs fewer than n_components
when it attempts but doesn't manage?
be overridden by Y0 if that is provided, but must be provided if no Y0 is | ||
provided. | ||
random_state: int | ||
The integer which acts as a reproducible seed for the initial matrices used in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The random seed used to initialize". I think the second sentence is useful information, but I think everyone will know what this is. btw, let's cross-check if you didn't already so we are using the names for common variables as scikit-learn.
@@ -15,23 +27,22 @@ def __init__(self, MM, Y0=None, X0=None, A=None, rho=1e12, eta=610, max_iter=500 | |||
# Capture matrix dimensions | |||
self.N, self.M = MM.shape | |||
self.num_updates = 0 | |||
self.rng = np.random.default_rng(random_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping on this one.
if self.A is None: | ||
self.A = np.ones((self.K, self.M)) + np.random.randn(self.K, self.M) * 1e-3 # Small perturbation | ||
self.A = np.ones((self.K, self.M)) + self.rng.normal(0, 1e-3, size=(self.K, self.M)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you addressed this with your comment to MM
, but as a general rule, please respond to each comment so the reviewer knows you have seen it. It wouldn't work here, but just thumbs up works if you have seen a comment and agree, but it saves time in the long run as I don't have to write this long comment...... :)